Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Some information about the data set

(The text below is copied from the HAR website.)

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Read more: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har#ixzz5GisaTutu

Data Loading, Preparation, and Visualisation

f1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
f2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
pml_training.csv <- read.csv(file = f1, header = TRUE )
pml_testing.csv <- read.csv(file = f2, header = TRUE )
rm(f1, f2)
## remove columns with subject variables (the first 7)
training <- pml_training.csv[ , -c(1:7) ]
testing <- pml_testing.csv[ , -c(1:7) ]

## remove columns with NAs
NA_training <- which(colSums(is.na(training)) == 0)
NA_testing <-  which(colSums(is.na(testing)) == 0)

# There are more columns with NAs in the test set, hence we take this list
# to clean both train and test set
training <- training[, NA_testing]
testing <- testing[, NA_testing]

A quick visualisation

For people unfamiliar with this visualisation: you can click on a cell in the heatmap and then the scatterplot will reveal the variables that reference the cell clicked.

# names(grep("factor", sapply(testing, class), value=TRUE))
# names(grep("factor", sapply(training, class), value=TRUE))

### cor heatmap + scatterplot of selected
iplotCorr(training  %>% select(-classe), reorder = TRUE)
## Set screen size to height=700 x width=1000

Model Building

Trees

fit_tree <- train(classe ~ . , method = "rpart", data = training)
rpart.plot(fit_tree$finalModel)
confusionMatrix(fit_tree)
## Bootstrapped (25 reps) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 25.4  9.0  8.4  7.7  3.3
##          B  0.7  6.0  0.7  2.7  2.4
##          C  1.9  3.2  7.4  3.9  3.3
##          D  0.3  0.8  0.6  1.7  0.6
##          E  0.2  0.4  0.3  0.3  8.8
##                             
##  Accuracy (average) : 0.4941
rpart.plot(fit_tree$finalModel)

RandomForest with parallel processing

Prepare for parallel processing

library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(parallel)
detectCores(); detectCores(logical = FALSE)
## [1] 4
## [1] 2

Build RandomForest Model

cluster <- makeCluster(detectCores()-1) # convention to leave 1 core for OS
registerDoParallel(cluster)

fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)

fit_rf <- train(classe ~ . , method = "rf", data = training, trControl = fitControl)

stopCluster(cluster)
registerDoSEQ()

confusionMatrix(fit_rf$finalModel$predicted, training$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 5574   17    0    0    0
##          B    4 3774    8    0    1
##          C    1    5 3403   23    6
##          D    0    1   11 3190    6
##          E    1    0    0    3 3594
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9956          
##                  95% CI : (0.9945, 0.9964)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9944          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9989   0.9939   0.9944   0.9919   0.9964
## Specificity            0.9988   0.9992   0.9978   0.9989   0.9998
## Pos Pred Value         0.9970   0.9966   0.9898   0.9944   0.9989
## Neg Pred Value         0.9996   0.9985   0.9988   0.9984   0.9992
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2841   0.1923   0.1734   0.1626   0.1832
## Detection Prevalence   0.2849   0.1930   0.1752   0.1635   0.1834
## Balanced Accuracy      0.9989   0.9966   0.9961   0.9954   0.9981

Performance of models

  1. The accuracy of the Tree model was 0.498
  2. The accuracy of the Random Forest Model was 0.996 We continue with the latter model to do predictions

The out of sample error for the Random Forest Model is: 0.4 percent

Predict with the best model: Random Forest Based

# print(fit_rf$finalModel)

pred_rf <- predict(fit_rf, newdata = testing)
print(pred_rf)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

Several models were tested, two were selected for reporting as they provide insight. The tree model has limited accuracy, while the random forest model is very good in predicting.